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Summary 

Viruses have global impact through mortality, nutrient 
cycling and horizontal gene transfer, yet their study is 
limited by complex methodologies with little valida- 
tion. Here, we use triplicate metagenomes to compare 
common aquatic viral concentration and purification 
methods across four combinations as follows: (i) tan- 
gential flow filtration (TFF) and DNase + CsCI, (ii) 
FeCIs precipitation and DNase, (IN) FeCIs precipitation 
and DNase + CsCI and (iv) FeCIs precipitation and 
DNase + sucrose. Taxonomic data (30% of reads) sug- 
gested that purification methods were statistically 
indistinguishable at any taxonomic level while con- 
centration methods were significantly different 
at family and genus levels. Specifically, TFF- 
concentrated viral metagenomes had significantly 
fewer abundant viral types (Podoviridae and Phycod- 
naviridae) and more variability among Myoviridae 
than FeCls-precipitated viral metagenomes. More 
comprehensive analyses using protein clusters (66% 
of reads) and k-mers (1 00% of reads) showed 50-53% 
of these data were common to all four methods, and 
revealed trace bacterial DNA contamination in TFF- 
concentrated metagenomes and one of three repli- 
cates concentrated using FeCIs and purified by DNase 
alone. Shared k-mer analyses also revealed that poly- 
merases used in amplification impact the resulting 
metagenomes, with TaKaRa enriching for 'rare' reads 
relative to PfuTurbo. Together these results provide 
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empirical data for making experimental design deci- 
sions in culture-independent viral ecology studies. 



Introduction 

Viruses are the most abundant and diverse biological 
entities on the planet (Wommack and Colwell, 2000; 
Rohwer, 2003). Their impact is global: affecting microbial 
hosts through mortality, remineralization of nutrients and 
horizontal gene transfer (reviewed in Fuhrman, 1999; 
2000; Wommack and Colwell, 2000; Weinbauer, 2004; 
Suttle, 2005; 2007; Breitbart etal., 2007). They can even 
drive the evolutionary trajectory of the Earth's fundamen- 
tal biogeochemical processes by encoding 'host' genes 
(Mann etal., 2003; Lindell etal., 2004; Sullivan etal., 
2005; 2006; Sharon etal., 2007) that are expressed 
during infection (Lindell etal., 2005; Clokie etal, 2006; 
Dammeyer etal., 2008) and confer a direct fitness advan- 
tage for the phage (Bragg and Chisholm, 2008; Hell- 
weger, 2009). Such viral-encoded photosynthesis genes 
likely also have significant ecological and evolutionary 
impact on ocean ecosystems as they dominate global 
ocean microbial metagenomes (Sharon etal., 2007) and 
alter the long-term evolutionary trajectories of both phage 
and host copies of the gene (Sullivan etal., 2006). 

Despite the abundance of viruses in aquatic marine 
environments, studying viruses in the wild is fraught with 
methodological challenges. Perhaps the most notable is 
that wild viruses and their microbial hosts are rarely cul- 
tivable (Edwards and Rohwer, 2005) and the field is 
reliant upon culture-independent methods such as 
metagenomics wherein the majority of reads (50-90%) 
show no significant similarity to a sequence within a 
known organism (Breitbart etal., 2002; Angly etal., 2006; 
Bench etal., 2007; Dinsdale etal., 2008; Williamson 
etal., 2008). Furthermore, to create metagenomes, viral 
particles must be concentrated and purified from small 
volumes of filtrate while minimizing contamination and 
artefact (Lawrence and Steward, 2010; Wommack etal., 
2010). We previously introduced a new method, FeCIa 
precipitation, for concentrating viruses from seawater 
(John etal., 2011) that was more efficient than the stan- 
dard method of tangential flow filtration (TFF). 
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While little is known about the effects of sample process- 
ing and library construction procedures on viral metage- 
nomes several studies have addressed experimental 
issues in microbial metagenomics. First, variable DNA 
extraction efficiencies across microbes in the environment 
are likely due to variation in cell wall and membrane 
structure (Carrigg eta!., 2007). Second, amplification pro- 
tocols used to bolster limiting quantities of DNA, such as 
multiple displacement amplification (MDA), are not quan- 
titative due to amplification bias (Yilmaz et a!., 201 0). Third, 
fosmid cloning can also bias the representation of species 
(Temperton etal., 2009). Finally, in silico processing is 
problematic where organisms are more divergent than 
those in databases (McHardy and Rigoutsos, 2007). Sum- 
marily, even in the case of in vitro metagenomic simulations 
where both the organisms and abundances were known a 
priori, sequence coverage variation was observed, and 
ascribed to differences in 'growth conditions, organismal 
growth phase, DNA extraction efficiency, cloning bias, 
sequencing efficiency, or relative genome copy number' 
(Morgan et a!., 2010). Thus rigorous, systematic, empirical 
studies are needed to take us one step closer to being able 
to meaningfully cross-compare metagenomic data sets 
generated using different methods. 

Here, we evaluate the effect of commonly used aquatic 
viral concentration and purification protocols on triplicate 
metagenomes across four different method combinations. 
Specifically, seawater collected from the site of the f i rst vi ral 
metagenomes, San Diego's Scripps Pier (Breitbart etal., 
2002), was used to generate triplicate viral metagenomes 
to evaluate taxonomic and protein cluster variability across 
two concentration (TFF and FeCIs precipitation) and three 
purification methods (DNase only, DNase + CsCI and 
DNase + sucrose) in four method combinations as follows: 
(i) TFF and DNase + CsCI, (ii) FeCb precipitation and 
DNase only, (iii) FeCb precipitation and DNase + CsCI and 
(iv) FeCIa precipitation and DNase + sucrose. 

Results 

Thirteen metagenomes were generated for this study 
from Scripps Pier, San Diego, California Pacific Ocean 
seawater: a microbial metagenome from the 0.2-2.7 |.im 
size fraction, and 12 viral metagenomes derived from the 
< 0.2 \im size fraction. The viral metagenomes were pro- 
duced using four standard protocols for viral concentra- 
tion and purification (TFF DNase + CsCI, FeCIs DNase 
only, FeCIs DNase + CsCI and FeCIs DNase + sucrose) in 
triplicate from the same seawater sample (Fig. 1). Tripli- 
cate metagenomes were used to rigorously evaluate 
intra- versus /nfer-method variability and document any 
taxon-specific or protein diversity biases associated with 
these methods. The replicated metagenomes are abbre- 
viated as TCI , TC2, TC3 (TC; TFF DNase + CsCI), FD1 , 




Fig. 1. Workflow showing the process of creating each viral 
metagenomic replicate using different concentration (TFF or FeCIs) 
and purification methods (DNase only, DNase + CsCI or 
DNase + sucrose) from a 21 0 I sample of seawater taken from 
Scripps Pier, San Diego, CA. The '3 x' refers to that each 
subsample was processed independently from the 210 I pooled 
initial sample. 

FD2, FD3 (FD; FeCb DNase only), FC1, FC2, FC3 (FC; 
FeCIs DNase + CsCI) and FS1, FS2, FS3 (FS; FeClg 
DNase + sucrose) for simplicity. 



Community composition and variability across methiods 

A one-way analysis of variance was conducted to evaluate 
the relationship between viral concentration and purifica- 
tion methods based on the hit count at the superkingdom, 
family and genus levels (see Experimental procedures. 
Table SI). At the broadest taxonomic level (superking- 
dom), we observed no significant differences across con- 
centration or purification methods other than slightly more 
variability between replicate metagenomes forTC and FS 
(Fig. SI, Table SI). Overall, 66-81% of the reads in viral 
metagenomes and 37% in the microbial metagenome had 
no significant hit to anything in available databases (see 
Experimental procedures). Of the subset of reads from 
viral metagenomes with significant hits, 44-63% had no 
superkingdom designation in the database, little to no 
reads were classified as 'Archaea', 1 1 -1 4% were classified 
as 'Bacteria', 2-3% as 'Eukaryota' and 22-42% as 
'Viruses'. Notably, these percentages include a small cor- 
rection (- 1 %) to reclassify abundant 'prophage' and 'AMG' 
(auxiliary metabolic gene) sequences that get misclassi- 
fied as 'bacterial' (see Experimental procedures). As 
expected, these superkingdom-level taxonomic findings 
contrasted those from the microbial metagenome where 
the bulk of the identified reads were 'Bacteria' (47%) and 
many fewer were 'Viruses' (3%), suggesting that the 
metagenomes from viral-size-fractionated seawater were 
indeed enriched for viruses regardless of concentration or 
purification method. 

With increasing taxonomic resolution at the family and 
genus levels respectively (Fig. 2), we found relatively con- 
sistent rank abundance profiles across all methods for the 
top 10 taxonomic designations. These top 10 taxonomic 
distributions represent 76-87% of the total reads that had 
a taxonomic designation at the family level and 52-72% at 
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Fig. 2. Rank abundance curve of (A) 
family-level (B) and genus-level taxonomy for 
the top 10 taxa observed in the data. Only 
four family-level and four genus-level taxa are 
viral (highlighted with red text); the remaining 
are microbial taxa. 



the genus level, thus capturing a sizable fraction of our 
annotated data set. Increasing the taxa to the top 30 only 
negligibly added more reads. We found significant differ- 
ences in the absolute count for the more abundant viral 
taxa at the family and genus levels (Table SI). With 
respect to the concentration method, metagenomes 
derived from TC viruses significantly underrepresented 
the Podo- and Phycodna- (Prasino- and Chloroviruses) 
viridae relative to FD, FC and FS viral concentrates. Also, 
the TC samples showed more variability in the absolute 
counts for the most abundant viral type, Myoviridae (T4- 
like viruses), as compared with FD, FC and FS (Table S1 ). 
The choice of purification method (FD, FC and FS), 
however, did not have a significant effect on the distribu- 
tion of hits at any taxonomic level. The most notable 
difference was that Rhodobacteriaceae was more 



variable in the FD replicates as compared with FC and FS 
viral concentrates. 

Diversity of protein clusters 

Given limited available annotation, we next explored a 
greater fraction of our data using protein clusters to esti- 
mate sequencing effort and protein diversity across 
methods (see Experimental procedures). When we clus- 
tered the - 1.8 million total open reading frames (ORFs) 
from all of the viral methods, we found that 27% clustered 
with existing protein clusters from the Global Ocean 
Survey (GOS), 12% clustered in novel protein clusters 
with 20 -H members, 40% with 2-19 members and 21% 
could not be placed in protein clusters at all. Our final set 
of 20 + high-confidence protein clusters consisted of 6845 
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concentration and purification metfiod. 
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clustering analysis and showed that 50% of reads are 
shared between all methods and 82% are shared with at 
least one other method. Also comparably, 1 2% were found 
in justTC and 17% in FeCls-precipitated samples (FD, FC 
and FS), indicating that each subset of reads may be 
specific to each concentration protocol as noted previously. 

To explore what drives the differences in the rarefaction 
curves we examined the fraction of 'rare' sequences 
(k-mer= 1; Fig. 4B) in the metagenomes, and found con- 
siderable variation both within and across methods. This 
variability appears to be driven by the enzyme used in 
linker amplification, as samples using the TaKaRa poly- 
merase had a higher percentage of rare sequences 
(> 13%) than those using PfuTurbo (< 1%). This was 
especially apparent in FD2, which was prepared twice for 
sequencing, once with PfuTurbo and a second time with 
TaKaRa and then combined into one sample. Here, the 



GOS protein clusters and 6178 novel protein clusters. 
When we mapped our reads back to the ORFs in high- 
confidence protein clusters 66% of reads matched. 

To test the 'replicability' of each method, we examined 
the ORF membership of each of the 20 + protein clusters 
with the null hypothesis that each method would be rep- 
resented by at least one ORF in these abundant protein 
clusters. Overall, we found that 53% of clusters contained 
ORFs from all four methods (Table 1). Of the remaining 
clusters, 13% were found in TFF-concentrated samples 
(TO) and 11% in FeCls-precipitated samples (FD, FC and 
FS) only and may represent clusters specific to the con- 
centration methods. In total, 82% of clusters contained 
ORFs from at least two methods indicating that most 
clusters were not specific to a single method and are likely 
to represent real viral proteins in the sample rather than 
artefact. 

We then explored the data from the high-confidence 
protein clusters (GOS clusters + novel clusters with 20 + 
members) using a rarefaction analysis to examine protein 
diversity in each sample and replicate. These analyses 
suggested that our sampling was relatively deep, but that 
four samples (TCI, TC2, TC3 and FD2) had greater 
protein diversity than other methods (Fig. 3A). 

Siiared l<-mer analysis of reads to distinguish metliods 
and replicates 

In order to better quantify and separate out distinct reads in 
each method or replicate that drive differences in the 
rarefaction curves (above), we compared and contrasted 
k-mers in reads for all samples representing 100% of our 
data set (Fig. 4A; see Experimental procedures). Overall, 
the k-mer analysis mirrored the results from the protein 
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Fig. 3. Rarefaction analysis of hits to protein clusters from each 
viral metagenome using (A) all sequences and (B) abundant 
(k-mer > 1 ) sequences. To be conservative, only protein clusters 
with > 20 members were used in these analyses. 
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Fig. 4. K-mer-based analysis of shared reads 
between methods and replicates. (A) 
Percentage of total reads that are shared 
between methods based on a l<-mer-based 
analysis and (B) percentage of 'rare' 
(k-mer= 1) versus abundant sequences 
(k-mer > 1) in each of the four viral 
metagenomic methods and a non-replicated 
microbial metagenome. The enzymes used 
for linker amplification (TaKaRa or PfuTurbo) 
are listed above each sample. The microbial 
sample includes more 'rare' sequences 
because the diversity in the sample is 
undersampled based upon rarefaction 
analysis (data not shown). 
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fraction of rare sequences in the metagenomes was 0.4% 
and 22% for PfuTurbo and TaKaRa respectively. When 
the 'rare' sequences were removed from all samples in 
the rarefaction analysis, the TFF replicates showed the 
least diversity (Fig. 3B). 

The increased percentage of rare sequences, however, 
does not fully explain the difference in the rarefaction 
curves, as other replicates were also processed using the 
TaKaRa enzyme and had a similarly high fraction of rare 
sequences. To further explain the differences, we taxo- 
nomically classified (by superl<ingdom) the metagenome 
reads that were unique and shared in each method 
(Fig. 5). We found that 'rare' sequences (k-mer =1) in 
these anomalously diverse samples were enriched for 
'bacteria' (15-22% in TCI, TC2, TC3, FD2 versus 5-9% 
for all other samples) and suppressed for 'virus' (5-1 1 % in 
these four samples versus 9-20% for all other samples; 
Fig. 5A). In contrast, the category rankings among 'non- 
rare' (k-mer > 1 ) sequences did not vary across methods 



(Fig. 5B). We also found that the ratio of annotated viral to 
bacterial reads is less than one in the subset of reads 
specific to each replicate suspected of bacterial contami- 
nation (TCI, TC2, TC3, FD2) and much greater than this 
(commonly > 2) for the other treatments or replicates 
(Table 2). 

Further, when we mapped the metagenome reads to 
five abundant microbial genomes, we found that the three 
anomalously diverse TFF replicates (TCI , TC2 and TC3) 
contained more recruitment to and spread throughout 
three of the five microbial genomes (alphaproteobacte- 
rium BALI 99, and gammaproteobacteria HTCC 2143 and 
HTCC 2148) (Fig. S2). This pattern contrasted that 
observed for the other methods where little to no recruit- 
ment was observed except in the microbial metagenome 
where the level and spread of recruitment was similar 
(Fig. S2). Reads that mapped to the two other abundant 
microbial genomes displayed a different pattern: (i) 
alphaproteobacterium HIMB114 hits clustered in ~ 50 kb 
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region for all viral metagenomes across all methods, but 
not for the microbial metagenome while (ii) hits to Daphnia 
pulex were present in just two genes across all viral and 
microbial metagenomes (Fig. S2). We interpret these 
latter findings to represent prophages and viral-encoded 
AlVIGs (sensu; Breitbart etal., 2007), while the former 
findings suggest that 'rare' sequences in TC1, TC2 and 
TC3 represented trace microbial DNA contamination that 
inflated their rarefaction curves. Despite having an 
increased percentage of bacterial hits relative to the other 
replicates, FD2 did not show the same pattern of recruit- 
ment as TC1, TC2 and TC3 to the abundant microbes. 
Notably, however, FD2 had three to fourfold more hits to 
Riiodobacteriaceae as compared with FD1 and FD3 
(Fig. S2), and it is likely to be sporadically contaminated 
by bacteria (potentially during amplification) from this 
family or other less abundant bacteria not analysed here. 

Because some replicates had as many as 23% rare 
sequences, we sought to determine whether these 
sequences were 'real' or artefact. To do this, we examined 
two key features of their hits to protein clusters: percent 
identity and coverage. While rare sequences hit protein 
clusters with a lower-percentage identity than their more 
abundant counterparts, the coverage of the hits was 
consistent except in the case of FD2 where sequences 
with 100% coverage increased from 16% in abundant 
sequences to 27% among rare sequences. Although the 
rare sequences map at lower-percentage identity, we 
interpret this to be due to the fact that 'rare' organisms are 
not well represented in our protein clusters (required > 20 
members for bona fide clusters) or any database (most 
studies sample only the dominant organisms). Given 
these results, we posit that these rare sequences are 
indeed 'real'. 



Discussion 

Here we find that the taxonomic signal inferred from the 
annotatable portion of viral metagenomes is relatively 
robust to variations in commonly used sample concen- 
tration and purification procedures (Lawrence and 
Steward, 2010; Wommack etal., 2010) with deviations 
now well documented. When looking at a larger fraction 
of our data set using protein clustering (66% of reads) 
and k-mers (100% of reads), we find that our sampling 
effort is relatively deep and the majority of protein clus- 
ters and reads are shared with at least one other 
method (82% and 81%) and (50% and 53%) are shared 
among all. Thus, small-scale differences define these 
methods. 

To this end, we caution researchers on three fronts. 
First, if desiring bacteria-free viral metagenomes, then 
avoidance of TFF concentration and DNase-only purifi- 
cation methods may help. Even though all methods were 
relatively rigorous in extracting viruses, we observed 
microbial DNA contamination in trace amounts predomi- 
nantly in the 'rare' (k-mer=1) component of TFF- 
concentrated samples, as well as one replicate of the 
DNase-only purification samples. Second, if interested in 
studying particular viral taxa, care should be taken in 
choosing concentration and purification methods as 
some taxa, while not undersampled to the point of alter- 
ing rank abundance, may be underrepresented by some 
methods. We note here that our data sets were not rich 
in the larger eukaryotic viruses (Yoshida etal., 2011) that 
may be due to undersampling in this study as a result of 
a 0.2 urn prefiltration step to create the 'viral size frac- 
tion'. Further, it is possible that TFF may have allowed 
some smaller viruses (e.g. podoviruses and non-tailed 
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viruses) to pass through the lOOkDa filter. Empirical 
data from Pacific Ocean viral communities suggest that 
< 0.3% of the total viruses pass through these filters (J. 
Brum, pers. comm.). If all of viruses that passed through 
filtration were podoviruses this could explain the reduc- 
tion in podoviruses we found in the TFF samples as com- 
pared with other samples that also represented ~ 0.3% of 
reads in the TFF samples. Overall, viral loss in permeate 
should only minimally impact the taxonomic assess- 
ments described here, given that lOOkDa filter pore 
sizes are only - 10 nm and the smallest known ocean 
viruses are 20 nm. Tangential flow filtration concentration 
set-ups of 30 kDa and 50 kDa used in previous marine 
viral ecology studies may have further reduced loss of 
small viruses. 

Third, it is important to consider amplification options 
when preparing environmental viral metagenomes. The 
metagenomes in this study were generated from DNA 
that was linker-amplified using protocols optimized 
for quantitative metagenomics from next-generation 
sequencing (see companion paper Duhaime etal., 
2012). We document here that the choice of polymerase 
greatly impacts your access to 'rares' in the community, 
but in a systematic manner. Mechanistically, we specu- 
late that late cycle PGR dCTP deamination to dUTP 
inhibits amplification of dominant templates in the 
TaKaRa reactions thus selecting for rare templates not 
yet deaminated. Such issues are not encountered in the 
PfuTurbo reactions because it contains an enzyme to 
convert dUTP products to dUMP that allows dominant 
templates to be processed In proportions relative to their 
actual occurrence in the population. Further, while linker 
amplification methods have a slight systematic (%G+G) 
bias (Duhaime etal., 2012), they are incredibly precise 
as evidenced by minimal variation between replicates 
which allows for quantitative cross-comparison between 
samples. In contrast, other published viral metagenomic 
data sets suffer from methodological issues not recog- 
nized at the time of publication - either being small and 
biased by cloning (e.g. linker-amplified and Sanger- 
sequenced; Breitbart etal., 2002; 2003; 2004; Bench 
etal., 2007) or resulting from whole genome-amplified 
DNA (Angly etal., 2006; Dinsdale etal., 2008) which is 
now known to have unpredictable, stochastic biases that 
lead to non-quantitative metagenomic data sets (Yilmaz 
etal., 2010), as well as systematic biases of particular 
relevance to viruses (Kim etal., 2008; Kim and Bae, 
2011). 

Finally, the 'bacterial' signal in viral metagenomes pre- 
sents an area where informatics solutions are greatly 
needed. Specifically, viral metagenomes commonly 
contain up to -1/3 'bacterial' sequences (e.g. Breitbart 
etal, 2002; 2003; Angly etal., 2006; Bench etal., 2007; 
Gantalupo etal., 2011) which are loosely attributed to 
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'auxiliary metabolic genes' or host genes legitimately in 
viruses (sensu; Breitbart etal., 2007), prophages in 
microbial genomes that are yet to be annotated or micro- 
bial DNA that is mispacl<aged in viral capsids in elements 
called gene transfer agents (GTAs; Lang and Beatty, 
2007; Stanton, 2007; Biers etal., 2008). Here we present 
a new means to quantify the relative proportion of 
'bacterial' hits that are prophage through examining the 
distribution of reads mapping to abundant bacterial ref- 
erence genomes; we find that prophages contribute a 
relatively small fraction (only 1%) which is in line with the 
fraction of microbial genomes devoted to identifiable 
prophages (e.g. Casjens, 2003). While columns and 
reagents can be contaminated with low levels of bacteria 
and mouse sequences (van der Zee etal., 2002; Evans 
etal., 2003; Shen etal., 2006; Eriwein etal., 2011), non- 
uniform contamination across treatments and replicates 
argues against l<it-based contamination being respon- 
sible. Further, it is unlikely that contaminating DNA would 
have survived myriad purification methods across tripli- 
cate samples, which suggests that GTAs are the prob- 
able largest contributor to the 'bacterial' signal in these 
data. 

Conclusions 

The data and analyses presented here establish a quan- 
titative frameworl< for researchers to more rigorously 
understand and plan for biases in the sequence-based 
methods used to compare viral communities over space 
and time. Notably, however, our work is limited to dsDNA 
viruses and there likely remain many biases to be rigor- 
ously investigated in viral ecology. For example, new 
copurification methods allow simultaneous access to 
RNA and DNA viruses from the same sample (Andrews- 
Pfannkoch etal., 2010). While this is a giant step forward 
for sampling, it remains an open question whether the 
method effectively purifies RNA and DNA viruses in a 
manner that preserves the relative representation of 
these viruses from the wild. Further, myriad sequencing 
and library preparation options now exist for generating 
viral metagenomes that begs the question of their inter- 
comparability. With advancing sequencing and informat- 
ics technologies, quantitative evaluation becomes 
possible. Ultimately, as the field develops quantitative 
rigour with existing population level metrics, we will also 
migrate down the 'single-entity genomics' route (Allen 
etal., 2011) in the quest to map the population structure 
and quantify the relative abundance of viruses in wild 
communities. Our rigorous analysis of the current meth- 
odologies used for producing viral metagenomes comple- 
ments these single-cell genomics efforts towards 
obtaining a less biased view of community composition 
and protein diversity. 



Experimental procedures 

Isolation of nucleic acid from SIO seawater 
microbial fraction 

Approximately 200 I of seawater was filtered through a 
Whatman GF/D (2.7 |im) prefilter onto a Millipore Steripak 
GP20 (0.2 i^m) filter unit after which 10 ml of 0.2 |xm filtered 
sucrose lysis buffer (SLB, 50 mM TrisCI pH 8.0, 40 mM 
EDTA, 0.75 M sucrose) was added and the unit stored at 
-80°C until DNA extraction. Total nucleic acid was isolated 
using a modification of the protocol described in Frias-Lopez 
and colleagues (2008). Briefly, lysozyme (5 mg mM in SLB) 
was added to the thawed unit to a final concentration of 
0.5 mg m\-\ After incubation at 37°C for 30 min, 5 M NaCI 
(0.2 |xm filtered) was added to a final concentration of 0.2 M. 
Proteinase K (20 mg mM) was then added (final concentra- 
tion 0.5 mg mM) along with 10% SDS (0.2 ^m filtered, final 
concentration 1%) and the unit incubated at 55°C for 20 min 
followed by 70°C for 5 min. The lysate was removed (~ 14 ml) 
and extracted two times with phenol/chloroform (50:50 vol) 
equilibrated with TE followed by one extraction with chloro- 
form. Phases were separated by centrifugation at 4°C for 
5 min at 3320 g (4 K). Nucleic acid in the aqueous phase was 
concentrated using Amicon Ultra15 100 K MWCO filters (Mil- 
lipore) to approximately 600 |xl. Because the A260/280 ratio was 
below 1 .8 when analysed by nanodrop, the nucleic acid was 
further purified with one phenol/chloroform (50:50 vol) extrac- 
tion followed by one chloroform extraction. Thereafter, 3 M 
sodium acetate (0.2 |xm filtered) was added (0.3 M final con- 
centration) followed by 2.5 x volumes of 100% ethanol 
(0.2 urn filtered) to precipitate the nucleic acid overnight at 
-20°C. After centrifugation at 13 K for 20 min, the pellet was 
washed with 70% ethanol and air-dried. The pellet was resus- 
pended in a total volume of 600 |xl TE (10 mM TrisCI pH 8, 
1 mM EDTA, 0.2 ^m filtered). By nanodrop, the total microbial 
nucleic acid recovered was 1 .6 |xg with an OD260/280 of 1 .99 
and OD260/230 of 2.36. 

Collection, concentration and purification of viral 
community DNA 

Approximately 210 I of surface seawater was collected from 
Scripps Pier (La Jolla, CA, USA; 7 April 2009) and prefiltered 
using a 150 mm GF/A filter (Whatman International, Maid- 
stone, UK; Gat. #1820-150) and a 0.22 ^m, 142 mm Express 
Plus filter (Millipore, Bellerica, MA, USA; Cat. #GPWP14250) 
with 20-50 I of filtrate haphazardly pooled into a 55 gallon 
trashcan that held ~ 180 I at a time to minimize any potential 
variation between 20 I and 50 I carboys. The viruses in the 
filtrate were concentrated using either TFF or FeCIs precipi- 
tation (FeCIs), the latter as in John and colleagues (2011). For 
the TFF method, triplicate 50 I subsamples were separately 
concentrated using a large-scale, 100 kDa TFF (Amersham 
Biosciences, Westborough, MA, USA; Cat. #UFP-100-C-9A) 
to 0.65-1 I followed by a small-scale, 1 00 kDa TFF (Millipore, 
Bellerica, MA, USA; Cat. #PXB100C50) to 12-14 ml; viruses 
were collected in the retentate after a final washing step. For 
the FeCIs method, triplicate 20 I subsamples were subjected 
to a chemistry-based concentration method (John etal., 
2011) where FeCIs creates virus iron precipitates that can 
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be collected on 1.0 ixm polycarbonate filters (GE Water 
and Process Technologies, Trevose, PA, USA; Cat. 
#K10CP14220) and resuspended in magnesium-EDTA- 
ascorbate buffer (0.1 M MgjEDTA, 0.2 M ascorbic acid, 
pH 6.0) using 1 ml of buffer per 1 I of seawater. Resuspen- 
sion was allowed to go overnight, rotating in the dark at 4°C, 
and the filters were transferred to fresh tubes and centrifuged 
for 5 min at low speed to collect the remaining fluid. The 
efficiency of recovery of viruses ranged from 18-26% using 
TFF to 92-95% using FeCIs precipitation (John etai, 2011), 
as determined by SYBR Gold staining and epifluorescence 
microscopy (Noble and Fuhrman, 1998). 

The resulting FeCIs precipitation viral concentrates were 
purified using one of three methods - DNase only, 
DNase + CsCI or DNase + sucrose - with each 20 I split into 
three purification methods. In contrast, the TFF viral concen- 
trates were only purified using DNase + CsCI. The 'DNase- 
only' method consisted of 100 U ml"' DNase I (Roche, 
Indianapolis, IN, USA; Cat. #10-104-159-001) in reaction 
buffer (10 mM Tris-HCI pH 7.6, 2.5 mM MgClj, 0.5 mM CaClj) 
for 2 h at room temperature on a tube rotator; DNase I was 
inactivated by 100 mM (final concentration) of each EDTA 
and EGTA. The 'DNase + CsCI' method consisted of layering 
DNase l-treated viral concentrates on top of CsCI-step gra- 
dients [1 .7, 1 .56, 1.4, 1 .2 g mr' in 1 00 kDa seawater perme- 
ate that had been autoclaved and 0.02 |xm filtered, sensu 
(Thurber etal., 2009)], followed by centrifugation in a SW40ti 
rotor (Beckman) at 24 000 rpm (102 000 gf) for 4 h, 10°C; 
viruses were harvested from fractions with densities of 
1 .4-1 .52 g ml '. Finally, the 'DNase + sucrose' method con- 
sisted of a 38% (w/v) sucrose 'cushion' prepared in 0.2 ^m 
filtered SM buffer (50 mM Tris-HCI pH 7.5, 100 mM NaCI, 
8 mM MgS04) whereby DNase l-treated viral concentrate 
was layered on top at a ratio of one part sucrose to three 
parts viral concentrate. The tubes were centrifuged in a 
TH641 rotor (Sorvall) at 32 000 rpm (175 000 g) for 3 h, 
18°C. The pellets beneath the sucrose cushion were col- 
lected in Tris-EDTA buffer (TE, 10 mM Tris-HCI pH 7.6, 1 mM 
EDTA) containing 100 mM each EDTA and EGTA (S.J. Will- 
iamson, pers. comm.). 



Extraction and linker amplification of viral 
community DMA 

DNA was extracted from concentrated, purified viral particles 
using Wizard® PCR Preps DNA Purification Resin and Mini- 
columns (Promega, Madison, Wl, USA; Cat. #A7181 and 
A7211 respectively) as previously described (Henn etal., 
2010). 

The DNA was prepared for sequencing using a linker 
amplification (LA) protocol modified from Henn and col- 
leagues (2010). Briefly, DNA was sheared to a size of 400- 
800 bp using Covaris Adaptive Focused Acoustics (AFA) with 
the following conditions: 1 30 |xl of DNA in TE buffer with up to 
5 |xg of total DNA, duty cycle of 5%, intensity of 3, 200 cycles 
per burst, for 62 s (E210). The DNA was concentrated to 
35 |il using Microcon YM-100 centrifugal filter units (Millipore, 
Bellerica, MA, USA; Cat. #42412). The End-It DNA End- 
Repair kit (Epicentre Biotechnologies, Madison, Wl, USA; 
Cat. #ER 81050) was used to end repair the sheared DNA. 



After clean-up using the Min-Elute Reaction Clean-up kit 
(Qiagen Sciences, Germantown, MD, USA; Cat. #28204), the 
DNA was ligated to a hemi-phosphorylated adaptor 
(Linker-A) using the Fast-Link DNA Ligation kit (Epicentre 
Biotechnologies, Madison, Wl, USA; Cat. #LK 6201 H). The 
double-stranded Linker A was prepared by annealing the 
single-stranded forward oligonucleotide (5'-phosphorylated- 
GTA TGC TTC GTG ATC TGT GTG GGT GT-3') to the 
reverse oligonucleotide (5'-CCA CAC AGA TCA CGA AGC 
ATA C-3'). This was performed in TE (1 0 mM Tris-HCI pH 7.6, 
1 mM EDTA) buffer supplemented with 50 mM NaCI. The 
DNA solution was heated in a water bath to 100°C for 5 min. 
The water bath was allowed to cool to room temperature and 
the annealed linker DNA was then placed on ice for 5 min. 
Linker A was diluted to 10 |xM in nuclease-free water before 
use. After ligation, the DNA was immediately cleaned up 
using the MinElute Reaction Clean-up kit. Linker-ligated DNA 
was mixed with 6 x Blue/Orange Loading Dye (Promega, 
Madison, Wl, USA; Cat. #G1 881 ) and size-fractionated by gel 
electrophoresis in 1.5% SeaKem GTG agarose (Lonza, 
Rockland, MD, USA; Cat. #50071) prepared in sterile TAE 
(40 mM Tris-acetate, 2 mM EDTA) buffer and run at 80 V for 
90 min. DNA markers placed in the outermost lanes were 
either Quick Load 100 bp DNA Ladder (New England 
Biolabs, Ipswich, MA, USA; Cat. #N0467S) or 1 kB-Plus DNA 
Ladder (Invitrogen, Carlsbad, CA, USA; Cat. #10787-026). 
The marker lanes were stained with ethidium bromide 
(1 ng (il ') for 20 min and were used to as a guide to excise 
DNA in the range of 400-800 bp. DNA was extracted from the 
agarose slice using the Min-Elute Gel Extraction kit (Qiagen 
Sciences, Germantown, MD, USA; Cat. #28604). 

The base sequence of the PCR phos-A primer used for 
amplification was 5'p-CCACACAGATCACGAAGCATAC-3'. 
Because multiple sources of DNA were to be pooled prior to 
library preparation for 454 pyrosequencing, 5 bp barcodes 
were introduced at the 5' end of the primer so that the 
sequences from different sources could be identified from 
the data. The PfuTurbo Hotstart system (Stratagene, La 
Jolla, CA, USA; Cat. #600600) was used for amplification 
reactions. Reaction conditions were: 1-2 (xl of Linker-A 
ligated, size-fractionated DNA, 12.5 |xl of PfuTurbo Hotstart 
2X Master Mix (0.1 U PfuTurbo per microlitre), 0.5 |xl 
(5 pmol) of the 10|xM PCR phos-A primer, brought up to 
25 |xl with nuclease-free water (Promega, Madison, Wl, 
USA; Cat. #P1193). Thermocycling conditions were dena- 
turation at 95°C for 2 min, cycling for 25 to 30 cycles using 
95°C for 30 s, 60°C for 60 s and 72°C for 90 s, and a final 
extension at 72°C for 10 min. Products were analysed on 
1 .5% agarose gels containing 0.5 ng |xM ethidium bromide, 
run in TAE buffer at 90 V for 30 min, using 5 |xl of DNA. 
Amplified DNA was recovered from the PCR reaction mixes 
using the MinElute PCR Purification kit (Qiagen Sciences, 
Germantown, MD, USA; Cat. #28004) according to the 
manufacturer's directions. DNA was eluted off the mini- 
columns using 25-40 |xl of the provided EB buffer warmed 
to 80°C. DNA was quantified using the Quant-iT Pico Green 
dsDNA assay kit (Invitrogen, Carlsbad, CA, USA; Cat. 
#P7589). Prior to sequencing library preparation, samples 
were pooled in equal amounts (generally considered 
molar equivalents due to shearing and size-fractionation 
steps). 
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Table 3. Total sequences, QCed sequences and hits to SIMAP for each replicate in four viral metagenomic methods and a non-replicated 
microbial metagenome. 



Viral 

concentration 
method 


Viral 

purification 
method 


Replicate 
number 


Total 

sequences 


QCed 
sequences 


Hits to 
Ql^/l A p 

OlMrtr 


% hits 

lO ollvlrtr 


FeCIa 


DNase 


1 


1 72 745 


1 34 504 


41 582 


30.9 


FeCl3 


DNase 


2 


345 858 


236 591 


60 428 


25.5 


FeCl3 


DNase 


3 


428 258 


274 368 


79 238 


28.9 


FeCIa 


DNase + CsCI 


1 


175 121 


141 000 


44 119 


31.3 


FeCIa 


DNase + CsCI 


2 


218 300 


175 119 


59 322 


33.9 


FeCIa 


DNase + CsCI 


3 


261 763 


171 220 


52 703 


30.8 


FeCIs 


DNase + sucrose 


1 


162 078 


122 151 


39 724 


32.5 


FeCIs 


DNase + sucrose 


2 


216 304 


158 816 


52 786 


33.2 


FeCIa 


DNase + sucrose 


3 


308 139 


223 859 


43 436 


19.4 


TFF 


DNase + CsCI 


1 


421 395 


308 510 


62 979 


20.4 


TFF 


DNase + CsCI 


2 


264 916 


193 113 


60 408 


31.3 


TFF 


DNase + CsCI 


3 


436 484 


319 781 


94 926 


29.7 


Microbial 






136 227 


100 704 


63 252 


62.8 



Sequencing viral community DMA 

Metagenomic sequencing was performed using a GS-FLX 
Titanium system (454 Life Sciences, Roche, Mannheim, 
Germany) at the Broad Institute, Duke Institute for Genome 
Sciences and Policy, and the University of Arizona Genetics 
Core facility to generate ~ 3.5 million reads for the 12 viral 
metagenomes and one microbial metagenome described 
above (Table 3). Reads were filtered to remove low-quality 
data based on protocols suggested by Huse and colleagues 
(2007). Briefly, reads were removed if they contained an 'N' 
anywhere in the sequence, were longer or shorter than two 
standard deviations from the mean sequence length or had a 
mean quality score less than two standard deviations from 
the mean quality score for all reads. Lastly, artificial dupli- 
cates from the pyrosequencing runs were removed using 
cd-hit-454 from the cd-hit package version 4.5.5 with default 
parameters (Niu etal., 2010). Based on these criteria, ~ 2.6 
million high-quality reads were retained for further analysis 
(Table 3). All sequences were deposited to CAMERA (http:// 
camera.calit2.net) under the Project Accession Number 
CAM_P_0000914. 



Taxonomic assignment of reads 

Sequences that passed quality control were compared 
against the Similarity Matrix of Proteins (SIMAP) released on 
25 June 2011 (Rattei etal., 2006) using BLASTX (Altschul 
etal., 1997). Hits were considered significant if they had an 
E-value of < 0.001 . Although we did not impose a hit length 
cut-off, the majority of matches (99%) were between 100 and 
500 bp in length with a mean length of 248 bp. To constrain 
our analyses to known organisms in SIMAP, the top 10 hits 
were analysed and sequences were assigned to the top hit 
that was not from an 'uncultured' organism. Taxonomic data 
at the species level were assigned based on the SIMAP hit, 
where the read inherited the taxonomy ID associated with the 
protein from SIMAP. Taxonomy data at the superfamily, family 
and genus levels were acquired using the NCBI taxonomy 
based on the species taxonomy ID. A subset of the NCBI 
taxonomy records for the most abundant viruses in the 
samples were manually curated to fill in missing data on the 



family and genus levels (http://www.eebweb.arizona.edu/ 
faculty/mbsulli/scripts/hurwitz). We also updated NCBI taxa in 
our analyses for abundant bacteria that were missing anno- 
tation at the family level by assigning 'Candidatus Pelagi- 
bactef to the family ' Ricketsiaceae' and 'Synechococcus' to 
the family 'Synechococcaceae' . 

Prophage and AMG detection 

To differentiate true bacterial hits from prophage regions in 
bacterial genomes or AMGs, we applied a secondary filter on 
sequences whose best match was bacterial in SIMAP. To do 
this, we constructed an in-house prophage database by com- 
paring marine microbial genomes from the Gordon and Betty 
Moore Foundation to genes in viral genomes in NCBI and 
running Prophage Finder (http://bioinformatics.uwp.edu/ 
-phage/help.htm). The viral gene sequences were down- 
loaded from NCBI on 7 July 201 1 , resulting in a total of 36 603 
genes. The marine microbial genomes were downloaded 
from CAMERA (http://camera.calit2.net) on 7 July 2011 for a 
total of 19 459 genomic sequences. The microbial genomes 
were compared with the phage genes using BLASTX with an 
£-value cut-off of < 0.001 and a total of ~ 1 million possible 
matches and alignments reported. Prophage regions were 
detected using Prophage Finder with default parameters. A 
total of 366 345 prophage protein sequences from 10 703 
DNA fragments were found using this approach. To supple- 
ment these data, we also included data from Itai Sharon and 
colleagues on AMG sequences (Yooseph etal., 2007; 
Sharon ef a/., 2011). 

Statistical analyses of taxonomic distributions 

Multiple statistical tests were conducted using a one-way 
analysis of variance (anova) to evaluate the relationship 
between viral concentration and purification methods and 
taxonomic groups. A separate test was performed for each of 
the top 10 taxonomic groups at the family and genus levels 
and all at the superkingdom level using Matlab (anoval) 
(Table SI). In each case, the independent variable was the 
method, and the dependent variable was the number of hits 
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to the taxonomic group being tested. Because each of the 
methods and replicates had a variable number of total 
sequences, we normalized the hit count prior to our analysis 
by dividing the hits by the total sequences in the replicate and 
multiplying by 196 000 (the average number of sequences 
per library in 1000s). If the anova was deemed to be signifi- 
cant (P-value was < 0.05), we performed follow-up tests 
using the Tukey HSD test in Matlab (multcompare) to evalu- 
ate pairwise differences and identify means that differed 
between methods. 



Microbial genomic recruitment plots 

To investigate whether hits to microbial genomes were from 
prophage or AMGs or sporadic microbial contamination, we 
created genomic recruitment plots for five abundant 
microbes: alphaproteobacterium HIMB114, Daphnia pulex, 
alphaproteobacterium BAL199, and gammaproteobacteria 
HTCC 2143 and HTCC 2148. To do this, we compared 
sequences whose best BLAST match was to the aforemen- 
tioned genomes to the contig sequences from each respec- 
tive genome. Each hit was required to match with an E-value 
of < 0.001 and only the top match was retained. We used a 
Matlab script provided by Maureen Coleman to plot the blast 
data for the reads along a reference genome and calcu- 
late the coverage (http://www.eebweb.arizona.edu/faculty/ 
mbsulli/scripts/hurwitz). Prophage regions were differentiated 
from AMGs because read coverage in these regions spanned 
> 4 kb in the microbial genome and was not confined to a 
single gene. Both AMG and prophage regions could be dif- 
ferentiated from sporadic contamination from microbial 
genomes based on a lack of alignment to the rest of the 
genome. 

K-mer analysis for discovering 'rare' sequences 

Rare sequences (k-mer =1) were distinguished from more 
abundant sequences (k-mer >1) in the samples using 
vmatch version 2.1.5 (http://www.vmatch.de/). Specifically, 
we used mkvtree to create a suffix array for each sample, and 
then used vmerstat to search for the frequency of 20-mers in 
each of our metagenomic sequences with a minimum occur- 
rence of 2 as compared with other sequences in the same 
sample. We parsed the vmatch data using a PERL script to 
assign a single frequency to each sequence based on the 
mode k-mer frequency of all of its 20-mer subsequences. The 
high-throughput data-processing pipeline containing the 
scripts for running these analyses is available here (http:// 
www.eebweb.arizona.edu/faculty/mbsulli/scripts/hurwitz). 

Protein clustering and rarefaction 

In order to find proteins, reads with a k-mer frequency > 1 
were assembled into larger contigs using velvet version 
1.0.15 (hash length = 29, -long) (Zerbino and Birney, 2008). 
Open reading frames were determined both on the individual 
reads (Table 3) and in assembled contigs using the metage- 
nomic mode in Prodigal (Hyatt et al., 2010). Only non- 
redundant ORFs > 60 amino acids in length were retained. 
Protein sequences were clustered based on homology using 



cd-hit-v4.5.5-201 1-03-31 (Niu etal., 2010) in a two-step 
process. First, we downloaded core cluster GOS (Yooseph 
et al., 2007) proteins from CAMERA (http://camera.calit2.net) 
and recruited sequences to known GOS protein clusters 
using cd-hit-2d ('-g 1 -n 4 -d 0 -T 24 -M 45000'). Sequences 
were considered to have a match if they hit with > 60% 
identity and > 80% coverage to the smallest sequence. 
Sequences that did not recruit to GOS protein clusters were 
then self-clustered using cd-hit with the same parameters as 
above. In total, our reads mapped to 11 116 GOS clusters 
and 61 78 novel clusters with greater than 20 members. When 
we subtracted out data from the SIO microbial data set, our 
reads mapped to 6845 GOS clusters and 6178 novel clus- 
ters, with 449 980 and 210 451 reads respectively. The clus- 
tering pipeline containing the scripts for running these 
analyses is available here (http://www.eebweb.arizona.edu/ 
faculty/mbsulli/scripts/hurwitz). 

We compared all of the high-quality metagenomic reads in 
our data set (Table 3) with the sequences in the 20 + protein 
clusters using BLASTX (£-value < 0.001). Based on these 
data, we generated hit counts to the protein clusters and used 
the data for further rarefaction analysis using the rarefaction 
calculator (http://www.biology.ualberta.ca/jbrzusto/rarefact. 
php). 
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Supporting information 

Additional Supporting Information may be found in the online 
version of this article: 

Fig. SI. Superkingdom taxonomic profile across triplicate 
samples from each of the four viral metagenome methods 
and a non-replicated microbial metagenome. Note that these 
data represent only those metagenomic reads that had a 
significant hit to the SIMAP database. 
Fig. S2. Fragment recruitment plots that show the distribu- 
tion of metagenomic sequence reads that map to five abun- 
dant microbial genomes. 

Table SI. The results of a one-way analysis of variance for 
the top taxonomic levels based on rank abundance in the 
samples by superkingdom, family and genus. Significant 
results are shown in bold. Abbreviations are used for each 
method: FC = FeCl3 CsCI + DNase, FS = FeCIa sucrose + 
DNase, FD = FeCIs DNase, TC = TFF and CsCI + DNase and 
TF = TFF CsCI + DNase. 
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